CorpusExplorer: Supporting a Deeper Understanding of Linguistic Corpora
نویسندگان
چکیده
Word trees are a common way of representing frequency information obtained by analyzing natural language data. This article explores their usage and possibilities, and addresses the development of an application to visualize the relative frequencies of 2-grams and 3-grams in Google’s ”English One Million” corpus using a two-sided word tree and sparklines to show usage trends through time. It also discusses how the raw data was processed and trimmed to speed up access to it.
منابع مشابه
An Open Linguistic Infrastructure for Annotated Corpora
Annotated corpora are a fundamental resource for research and development in the field of natural language processing (NLP). Although unannotated corpora (for example, Gigaword, Wikipedia, etc.) are often used to build language models, annotations for linguistic phenomena provide a richer set of features and hence, potentially better models in the long run. It is widely accepted that a first st...
متن کاملA Cross-linguistic and Cross-cultural Study of Epistemic Modality Markers in Linguistics Research Articles
Epistemic modality devices are believed to be one of the prominent characteristics of research articles as the commonly used genre among the academic community members. Considering the importance of such devices in producing and comprehending scientific discourse, this study aimed to cross–culturally and cross-linguistically investigate epistemic modality markers as an important subcategory...
متن کاملConceptualizing Sensory Relativism in Light of Emotioncy: A Movement beyond Linguistic Relativism
Given the significance of relativism in molding our worldview and uncovering the nature of truth, this study using the newly-developed concept of emotioncy, attempted to introduce sensory relativism as a new perspective based on which senses can relativize our understanding of the world. To espouse the theory, 24 individuals were interviewed on their experiences...
متن کاملTowards deeper understanding of the latent semantic analysis performance
The paper studies the factors influencing the performance of the Latent Semantic Analysis. Unlike previous related research that concentrates on parameters such as matrix elements weighting, space dimensionality, similarity measure etc., we address the impact of another fundamental factor: the definition of “word”. For the purpose, series of experiments were performed on two corpora in order to...
متن کاملDeveloping a Deep Linguistic Databank Supporting a Collection of Treebanks: the CINTIL DeepGramBank
Corpora of sentences annotated with grammatical information have been deployed by extending the basic lexical and morphological data with increasingly complex information, such as phrase constituency, syntactic functions, semantic roles, etc. As these corpora grow in size and the linguistic information to be encoded reaches higher levels of sophistication, the utilization of annotation tools an...
متن کامل